We will do a clustering analysis on the iris data set, using the k-means technique and a hierarchical clustering approach

The iris data set is one of R’s built-in data-sets, and is also available at the UCI Machine Learning Repository.

The data set was created by statistician Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data consists of 3 classes of flower types:

The data has 4 attributes:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

The built-in data set iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

library(tidyverse)

Let us first create a plot using the variables Petal.Width and Petal.Width, coloring every point by their class:

ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
  geom_point()

Using k-means

Let us try to find the 3 groups using k-means

# use kmeans() with the subset of columns we want to consider
iris_mini <- iris[, c("Petal.Width", "Petal.Length")]
km_iris <- kmeans(x = iris_mini, centers = 3)

The object km_iris contains an attribute called cluster that we want to extract to see the different assignments:

str(km_iris)
List of 9
 $ cluster     : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
 $ centers     : num [1:3, 1:2] 0.246 1.359 2.048 1.462 4.293 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:2] "Petal.Width" "Petal.Length"
 $ totss       : num 551
 $ withinss    : num [1:3] 2.02 14.23 15.16
 $ tot.withinss: num 31.4
 $ betweenss   : num 519
 $ size        : int [1:3] 50 54 46
 $ iter        : int 2
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

Let us plot the points again, now coloring them based on the clusters found by the k-means algorithm:

km_results <- iris_mini %>%
                mutate(cluster = factor(km_iris$cluster))
km_results
# k-means results
km_results %>%
  ggplot() +
  geom_point(aes(x = Petal.Width, y = Petal.Length, color = cluster))

Using hierarchical clustering

We will perform the same task now using hiercarchical clustering. For this we use the function hclust()

# hierarchical clustering with single linkage
dis <- dist(iris_mini)
# use distance matrix as input for clustering method
hc_iris <- hclust(dis, method = "single")
hc_iris

Call:
hclust(d = dis, method = "single")

Cluster method   : single 
Distance         : euclidean 
Number of objects: 150 

Explore the structure of the hc_iris object:

str(hc_iris)
List of 7
 $ merge      : int [1:149, 1:2] -1 -5 -9 -29 -34 -48 -50 -3 -39 -43 ...
 $ height     : num [1:149] 0 0 0 0 0 0 0 0 0 0 ...
 $ order      : int [1:150] 45 25 23 14 44 15 36 24 6 27 ...
 $ labels     : NULL
 $ method     : chr "single"
 $ call       : language hclust(d = dis, method = "single")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

To find the cluster assignments in this case, let us cut the dendrogram tree to the proper number of clusters using the cutree() (in our case 3)

# use cutree() to cut the dendrogram
cut3 <- cutree(hc_iris, k = 3)
# save the assignment as a factor
hc_results <- iris_mini %>%
                mutate(cluster = factor(cut3))
hc_results

And now recreate the plot with the assignment provided by the hierarchical clustering technique

hc_results %>%
  ggplot() +
  geom_point(aes(x = Petal.Width, y = Petal.Length, color = cluster))

# hierarchical clustering using complete linkage
dis2 <- dist(iris_mini)
# use distance matrix as input for clustering method
hc_iris2 <- hclust(dis2, method = "complete")
# use cutree() to cut the dendrogram
newcut3 <- cutree(hc_iris2, k = 3)
# save the assignment as a factor
hc_results2 <- iris_mini %>%
                mutate(cluster = factor(newcut3))
# create new plot
hc_results2 %>%
  ggplot() +
  geom_point(aes(x = Petal.Width, y = Petal.Length, color = cluster))

LS0tCnRpdGxlOiAiSy1tZWFucyBhbmQgSGllcmFyY2hpY2FsIENsdXN0ZXJpbmcgb24gdGhlIGBpcmlzYCBEYXRhc2V0IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpXZSB3aWxsIGRvIGEgY2x1c3RlcmluZyBhbmFseXNpcyBvbiB0aGUgYGlyaXNgIGRhdGEgc2V0LCB1c2luZyB0aGUgay1tZWFucyB0ZWNobmlxdWUgYW5kIGEgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgYXBwcm9hY2ggCgpUaGUgaXJpcyBkYXRhIHNldCBpcyBvbmUgb2YgUidzIGJ1aWx0LWluIGRhdGEtc2V0cywgYW5kIGlzIGFsc28gYXZhaWxhYmxlIGF0IHRoZSBbVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeV0oaHR0cDovL2FyY2hpdmUuaWNzLnVjaS5lZHUvbWwvKS4KClRoZSBkYXRhIHNldCB3YXMgY3JlYXRlZCBieSBzdGF0aXN0aWNpYW4gUm9uYWxkIEZpc2hlciBpbiBoaXMgMTkzNiBwYXBlciBUaGUgdXNlIG9mIG11bHRpcGxlIG1lYXN1cmVtZW50cyBpbiB0YXhvbm9taWMgcHJvYmxlbXMgYXMgYW4gZXhhbXBsZSBvZiBsaW5lYXIgZGlzY3JpbWluYW50IGFuYWx5c2lzLiAgCgpUaGUgZGF0YSBjb25zaXN0cyBvZiAqKjMgY2xhc3NlcyoqIG9mIGZsb3dlciB0eXBlczoKCiAgKiBzZXRvc2EKICAKICAqIHZpcmdpbmljYQogIAogICogdmVyc2ljb2xvcgogIApUaGUgZGF0YSBoYXMgKio0IGF0dHJpYnV0ZXMqKjoKICAKICAqIHNlcGFsIHdpZHRoCiAgCiAgKiBzZXBhbCBsZW5ndGgKICAKICAqIHBldGFsIHdpZHRoCiAgCiAgKiBwZXRhbCBsZW5ndGgKCgpUaGlzIGZhbW91cyAoRmlzaGVyJ3Mgb3IgQW5kZXJzb24ncykgaXJpcyBkYXRhIHNldCBnaXZlcyB0aGUgbWVhc3VyZW1lbnRzIGluIGNlbnRpbWV0ZXJzIG9mIHRoZSB2YXJpYWJsZXMgc2VwYWwgbGVuZ3RoIGFuZCB3aWR0aCBhbmQgcGV0YWwgbGVuZ3RoIGFuZCB3aWR0aCwgcmVzcGVjdGl2ZWx5LCBmb3IgNTAgZmxvd2VycyBmcm9tIGVhY2ggb2YgMyBzcGVjaWVzIG9mIGlyaXMuIFRoZSBzcGVjaWVzIGFyZSBJcmlzIHNldG9zYSwgdmVyc2ljb2xvciwgYW5kIHZpcmdpbmljYS4KClRoZSBidWlsdC1pbiBkYXRhIHNldCBgaXJpc2AgaXMgYSBkYXRhIGZyYW1lIHdpdGggMTUwIGNhc2VzIChyb3dzKSBhbmQgNSB2YXJpYWJsZXMgKGNvbHVtbnMpIG5hbWVkIGBTZXBhbC5MZW5ndGhgLCBgU2VwYWwuV2lkdGhgLCBgUGV0YWwuTGVuZ3RoYCwgYFBldGFsLldpZHRoYCwgYW5kIGBTcGVjaWVzYC4KCiFbXShodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9pcmlzLW1hY2hpbmVsZWFybmluZy5wbmcpCgoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpgYGAKCgpMZXQgdXMgZmlyc3QgY3JlYXRlIGEgcGxvdCB1c2luZyB0aGUgdmFyaWFibGVzIGBQZXRhbC5XaWR0aGAgYW5kIGBQZXRhbC5XaWR0aGAsIGNvbG9yaW5nIGV2ZXJ5IHBvaW50IGJ5IHRoZWlyIGNsYXNzOgoKYGBge3J9CmdncGxvdChpcmlzLCBhZXMoeCA9IFBldGFsLldpZHRoLCB5ID0gUGV0YWwuTGVuZ3RoLCBjb2xvciA9IFNwZWNpZXMpKSArCiAgZ2VvbV9wb2ludCgpCmBgYAoKCiMjIFVzaW5nIGstbWVhbnMKCkxldCB1cyB0cnkgdG8gZmluZCB0aGUgMyBncm91cHMgdXNpbmcgay1tZWFucyAKCmBgYHtyfQojIHVzZSBrbWVhbnMoKSB3aXRoIHRoZSBzdWJzZXQgb2YgY29sdW1ucyB3ZSB3YW50IHRvIGNvbnNpZGVyCmlyaXNfbWluaSA8LSBpcmlzWywgYygiUGV0YWwuV2lkdGgiLCAiUGV0YWwuTGVuZ3RoIildCmttX2lyaXMgPC0ga21lYW5zKHggPSBpcmlzX21pbmksIGNlbnRlcnMgPSAzKQpgYGAKClRoZSBvYmplY3QgYGttX2lyaXNgIGNvbnRhaW5zIGFuIGF0dHJpYnV0ZSBjYWxsZWQgYGNsdXN0ZXJgIHRoYXQgd2Ugd2FudCB0byBleHRyYWN0IHRvIHNlZSB0aGUgZGlmZmVyZW50IGFzc2lnbm1lbnRzOiAKCmBgYHtyfQpzdHIoa21faXJpcykKYGBgCgpMZXQgdXMgcGxvdCB0aGUgcG9pbnRzIGFnYWluLCBub3cgY29sb3JpbmcgdGhlbSBiYXNlZCBvbiB0aGUgY2x1c3RlcnMgZm91bmQgYnkgdGhlIGstbWVhbnMgYWxnb3JpdGhtOgoKLSBGaXJzdCBjb252ZXJ0IHRoZSBhc3NpZ21lbnQgdG8gYSBfZmFjdG9yXyB2YXJpYWJsZToKCmBgYHtyfQprbV9yZXN1bHRzIDwtIGlyaXNfbWluaSAlPiUKICAgICAgICAgICAgICAgIG11dGF0ZShjbHVzdGVyID0gZmFjdG9yKGttX2lyaXMkY2x1c3RlcikpCmttX3Jlc3VsdHMKYGBgCgotIE5vdyBwbG90IHRoZSBwb2ludHMgY29sb3JlZCBieSB0aGVpciBjbHVzdGVycyBmcm9tIHRoZSBrLW1lYW5zIHByb2NlZHVyZQoKYGBge3J9CiMgay1tZWFucyByZXN1bHRzCmttX3Jlc3VsdHMgJT4lCiAgZ2dwbG90KCkgKwogIGdlb21fcG9pbnQoYWVzKHggPSBQZXRhbC5XaWR0aCwgeSA9IFBldGFsLkxlbmd0aCwgY29sb3IgPSBjbHVzdGVyKSkKYGBgCgoKIyMgVXNpbmcgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcKCgpXZSB3aWxsIHBlcmZvcm0gdGhlIHNhbWUgdGFzayBub3cgdXNpbmcgaGllcmNhcmNoaWNhbCBjbHVzdGVyaW5nLgpGb3IgdGhpcyB3ZSB1c2UgdGhlIGZ1bmN0aW9uIGBoY2x1c3QoKWAgCgpgYGB7cn0KIyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB3aXRoIHNpbmdsZSBsaW5rYWdlCmRpcyA8LSBkaXN0KGlyaXNfbWluaSkKIyB1c2UgZGlzdGFuY2UgbWF0cml4IGFzIGlucHV0IGZvciBjbHVzdGVyaW5nIG1ldGhvZApoY19pcmlzIDwtIGhjbHVzdChkaXMsIG1ldGhvZCA9ICJzaW5nbGUiKQpoY19pcmlzCmBgYAoKRXhwbG9yZSB0aGUgc3RydWN0dXJlIG9mIHRoZSBgaGNfaXJpc2Agb2JqZWN0OgoKYGBge3J9CnN0cihoY19pcmlzKQpgYGAKClRvIGZpbmQgdGhlIGNsdXN0ZXIgYXNzaWdubWVudHMgaW4gdGhpcyBjYXNlLCBsZXQgdXMgY3V0IHRoZSBkZW5kcm9ncmFtIHRyZWUgdG8gdGhlIHByb3BlciBudW1iZXIgb2YgY2x1c3RlcnMgdXNpbmcgdGhlIGBjdXRyZWUoKWAgKGluIG91ciBjYXNlIDMpCgpgYGB7cn0KIyB1c2UgY3V0cmVlKCkgdG8gY3V0IHRoZSBkZW5kcm9ncmFtCmN1dDMgPC0gY3V0cmVlKGhjX2lyaXMsIGsgPSAzKQojIHNhdmUgdGhlIGFzc2lnbm1lbnQgYXMgYSBmYWN0b3IKaGNfcmVzdWx0cyA8LSBpcmlzX21pbmkgJT4lCiAgICAgICAgICAgICAgICBtdXRhdGUoY2x1c3RlciA9IGZhY3RvcihjdXQzKSkKaGNfcmVzdWx0cwpgYGAKCkFuZCBub3cgcmVjcmVhdGUgdGhlIHBsb3Qgd2l0aCB0aGUgYXNzaWdubWVudCBwcm92aWRlZCBieSB0aGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgdGVjaG5pcXVlCgpgYGB7cn0KaGNfcmVzdWx0cyAlPiUKICBnZ3Bsb3QoKSArCiAgZ2VvbV9wb2ludChhZXMoeCA9IFBldGFsLldpZHRoLCB5ID0gUGV0YWwuTGVuZ3RoLCBjb2xvciA9IGNsdXN0ZXIpKQpgYGAKCi0gTWF5YmUgaWYgd2UgdHJ5IGEgZGlmZmVyZW50IF9saW5rYWdlXyB3ZSBjYW4gb2J0YWluIGJldHRlciByZXN1bHRzOiAKCgpgYGB7cn0KIyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyBjb21wbGV0ZSBsaW5rYWdlCmRpczIgPC0gZGlzdChpcmlzX21pbmkpCiMgdXNlIGRpc3RhbmNlIG1hdHJpeCBhcyBpbnB1dCBmb3IgY2x1c3RlcmluZyBtZXRob2QKaGNfaXJpczIgPC0gaGNsdXN0KGRpczIsIG1ldGhvZCA9ICJjb21wbGV0ZSIpCmBgYAoKCi0gR2V0IG5ldyBhc3NpZ25tZW50cyBmcm9tIHRoZSBjb21wbGV0ZSBsaW5rYWdlIG9wdGlvbgoKYGBge3J9CiMgdXNlIGN1dHJlZSgpIHRvIGN1dCB0aGUgZGVuZHJvZ3JhbQpuZXdjdXQzIDwtIGN1dHJlZShoY19pcmlzMiwgayA9IDMpCiMgc2F2ZSB0aGUgYXNzaWdubWVudCBhcyBhIGZhY3RvcgpoY19yZXN1bHRzMiA8LSBpcmlzX21pbmkgJT4lCiAgICAgICAgICAgICAgICBtdXRhdGUoY2x1c3RlciA9IGZhY3RvcihuZXdjdXQzKSkKYGBgCgoKLSBBbmQgbm93IHJlY3JlYXRlIHRoZSBwbG90IHdpdGggdGhlIGFzc2lnbm1lbnQgcHJvdmlkZWQgYnkgdGhlIGhpZXJhcmNoaWNhbCBjbHVzdGVyaW5nIHRlY2huaXF1ZQoKYGBge3J9CiMgY3JlYXRlIG5ldyBwbG90CmhjX3Jlc3VsdHMyICU+JQogIGdncGxvdCgpICsKICBnZW9tX3BvaW50KGFlcyh4ID0gUGV0YWwuV2lkdGgsIHkgPSBQZXRhbC5MZW5ndGgsIGNvbG9yID0gY2x1c3RlcikpCmBgYAoKCg==